model tuning and stability
interpretability
application to case study
\(\lambda\) is a hyperparameter that controls model complexity.
As \(\lambda \uparrow \infty\), all coefficients shrink toward zero. As \(\lambda \downarrow 0\), we return to linear regression.
We can study the order that variables enter the model as we gradually decrease \(\lambda\). We expect that the most important features enter first.
Q: Can you tell if these coefficient paths \(\hat{\beta}_{j}\left(\lambda\right)\) come from ridge vs. lasso regression?
What is an appropriate “budget” for model complexity?
Decreasing \(\lambda\) allows many noise features into the model. Increasing \(\lambda\) misses true signals.
We need to tune \(\lambda\) to balance these competing issues and achieve good performance on new samples.
Split the data into \(K\) folds (e.g., \(K = 5\)). Fit the model on \(K−1\) folds and tested on the remaining fold. This mimics the setting of gathering new data and testing the model on that data.
We can compute performance on the holdout folds across \(\lambda\) complexity parameters. For each \(k = 1, \dots, K\), we compute
\[\begin{align*} CV_{k}\left(\lambda\right) = \frac{1}{\left|I_{k}\right|} \sum_{i \in I_{k}} \left(y_{i} - \hat{y}_{i}^{-k}\left(\lambda\right)\right)^2 \end{align*}\] where \[\begin{align*} \hat{y}_{i}^{-k}\left(\lambda\right) := x_{i}^\top \hat{\beta}^{-k}\left(\lambda\right) \end{align*}\] and \(\hat{\beta}^{-k}\left(\lambda\right)\) solves the lasso optimization at hyperparameter \(\lambda\) using data from all folds except \(I_{k}\).
The initial decrease is the phase where removing noise features improves performance. The later increase happens when we start ignoring real signal features.
\(\lambda_{\text{min}}\): Minimizes the cross-validation error.
\(\lambda_{\text{1se}}\): The simplest model, i.e., largest \(\lambda\), whose error is within one standard error of the minimum. This is a sparser model with comparable performance to the best.
Any statistical analysis depends on the particular data observed. Different data would yield different results.
The bootstrap estimates this variability without collecting new data.
In regularized regression, we quantify uncertainty in both predictions \(\hat{y}_i\) and coefficients \(\beta_k\).
Simulate new datasets by resampling the original data with replacement. For each bootstrap iteration \(b = 1, \dots, B\):
\[\begin{align*} \mathbf{y}^{b} = \begin{pmatrix}y_{1}^{b} \\ \vdots \\ y_{n}^{b}\end{pmatrix}, \quad \mathbf{X}^{b} = \begin{pmatrix}x_{1}^{b} \\ \vdots \\ x_{n}^{b}\end{pmatrix} \end{align*}\]
Response (\(y\)): Cell viability after treatment with the drug Ibrutinib.
Predictors (\(X\)): High-dimensional molecular profiles \(N = 121, J = 9553\).
Goal: Identify a sparse set of features that distinguish drug sensitive vs. resistant cells.
We use the glmnet package in R:
# Fit the lasso path and perform 10-fold CV
cv_fit <- cv.glmnet(X, y, alpha = 1, nfolds = 10, standardize = TRUE)This solves the optimization problem for a sequence of \(\lambda\) values and estimates cross-validation error across \(K = 10\) folds.
alpha = 1 is lassoalpha = 0 is ridgeRespond to [Evaluation Critique] in the exercise sheet.
It’s important to transform \(x_{ij} \to \frac{x_{ij} - \bar{x}_j}{\hat{\sigma}_j}\).
The penalty \(\lambda \sum |\beta_j|\) treats all \(\beta_j\) equally. If \(x_j\) has a large scale, its \(\beta_j\) will be small, making it “cheaper” for the penalty to keep it in the model.
We can visualize how the lasso coefficients \(\hat{\beta}_j\) evolve as \(\lambda\) decreases.
Each line represents a gene (RNA or methylation feature). The order in which they “emerge” from zero indicates their relative importance in predicting drug response.
Show how to render quarto files.
Check that coefficients stay at exactly 0 for many \(\lambda\).
Compare results from lasso vs. ridge regression.
When predictors are highly correlated, the \(\hat{\beta}\) estimated by lasso becomes unstable.
Small changes in the training data can cause the lasso to switch between two highly correlated genes, leading to conflicting interpretations of the same underlying signal.
We simulated data where \(\text{Cor}\left(x_{2}, x_{3}\right) > 0.95\). The true coefficients are
These are the feature selection frequencies across 1000 bootstrap iterations. Genes 1 and 2 compete for selection.
Selection \(\neq\) Importance
More generally, we should be careful applying sparse models to correlated data. There may be many plausible, equally predictive explanations based on different subsets of features.
Often, people assume that to get higher accuracy, we need “black-box” models (like deep learning or Random Forests) and sacrifice interpretability.
But in many scientific problems (like today’s case study, see also (“Benchmarks - Open Problems in Single-Cell Analysis — Openproblems.bio”)), sparse linear models are surprisingly competitive, while remaining more interpretable to more audiences. It depends on the true relationship between predictors and response in the data.
Respond to [Code Analysis] in the exercise sheet.